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Info rraaiio ?i- The o re tic Properties of 
Languages and their Grammars 
Bruce J. Mac Lennon 
Computer Science Department 
Naval Postgraduate School 
Monterey, CA 93943 

Abstract: We describe means for computing a number cf information-theoretic proper- 
ties of languages and their grammars. For example, the entropy of a system of sym- 
bols is widely" recognized as a measure cf that system’s complicity and organization. 
Yie show r how* the entropy of a language can be computed in a simple way from a gram- 
mar annotated vrith production probabilities. Yfe then develop means for statistically 
estimating these production probabilities from measurable properties of strings in Mae 
language. Vfe also consider '.he computation cf o.her information ! heore ; i ' pmc-r* 
cf languages and grammars, such as the avenge information born by a symbol m i 
language and the average information used by the productions of a grammar. 



1. Introduction 

The entropy of a system is widely recognized as a measure (actually, a reciprocal 
measure) of that system’s organization and structure [Shannon, Brillouin, Hamming, 
McKay, Cherry]. This suggests that the entropy of a language might be an important 
property to measure to form a basis for the quantitative comparison of languages. For 
this reason we have developed means for computing the entropies of languages. 
Specifically, we derive formulas for computing the entropy of a language from a gram- 
mar for that language that has been annotated with the probabilities of its productions 
being applied. We also show techniques whereby these production probabilities can be 
inferred from statistical properties of strings in the language. Finally, we apply the 
same techniques to several related issues, such as determining the average derivation 
length of a grammar, and the average information consumed by a grammar during 
string generation. These all seem to show premise as a means for making quantitative 
language comparisons. 
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2- Entropy cf a Language 
2.1 DcHnilicii cf Znlrcpy 

Suppose E is a finite system cf symbols in which symbol y h-^ a y ; 

probability cf occurrence p*. naturally, = 1- The entropy cf £ is de lined 

i=i 

#(S) = iftlgd/A). 

i = i 



where lg z = loggs . Since the entropy does not depend on the symbols y, ,.,nd ls com- 
pletely determined by the prcbdbilii ies p it it is simpler to define the enlrcpy in ter: ;s 
cf the cl priori probability distribution. The entropy cf the finite discrete probability 



distribution p lt po, . ,p ; . is 



H{p x .pz p k ) 






- S p, ig p . , 

1 = 1 



The preceding ideas are easily extended to infinite discrete probability distributions. 

30 

Suppose VjDv = 1. We define the entropy of this distribution: 

i = i 

Hip 1.P2. ■ • . ) = s p..lg(l/pO = “Sp-lgPi. 

i=l ^=1 

Note that ]>jDi = 1 does not guarantee the convergence of ^p, ig p„ : . That is, there are 

i 

probability distributions that do not have an entropy. Take, for example, 

Pi = C/(i ln 2 i). The sum ^Pi converges, but the entropy does not. For- 

tunately, these troublesome distributions do not seem to occur in practice. 

Entropy is widely recognized as a measure of disorganization, and thus lack of struc- 
ture [Erillouin]. When organization increases, entropy decreases; when entropy 
increases, structure decreases. Thus it is usually more convenient to work with negcn- 
tropy rather than entropy. The negentropy of a system is simpLy the negative of its 
entropy. Thus, when organization increases, so dees negentropy; when negentropy 
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decreases, so dees structure. The negentrepy H of a discrete distribution . - fined 

H(pt) = = y Pi i g p,. 

i 

2.2 The Eatrcpy c? a Langucge 
A language E is a (usually infinite) set cf strings 

E = \c lt C7n, . . . , C, , . . 

Now, let P(Oi) be the a priori probability cf occurrence of the string cn in E. The 
negentropy of the language E is simply 

H(Z) = g P{a ,) lg Pic,). 

In most interesting cases the number of strings in a language is infinite. The entropy cf 
an infinite language is thus defined in terms of an infinite set of probabilities. There- 
fore for most languages v/e are able to calculate the entropy only v/h en there is some 
finite description for that infinite set of probabilities, that is, v.'hen there is some struc- 
ture in that infinite set of probabilities. 

Although useful languages are usually infinite (i.e., comprise an infinite number of 
strings), they can be described finitely by a grammar That is, the grammar reflects 
the finite structure in the infinite set of strings. Thus suggests a solution to the prob- 
lem of finding a finite description of the infinity of probabilities associated vfith the 
strings in the language. 

The generation of each string in a language requires a finite number of elementary 
choices to be made. For example, in a grammar for arithmetic expressions there 
might be bvo productions for a nonterminal v: 

V -> 4 - 

V - 

In deriving a string from this grammar, the symbol V can be replaced by either ‘V cr 
a choice must be made. Thus, a finite sequence of choices rr lf tt 2 , . . . , rr k are 
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necessary to determine eac:h string in the language. Conversely, if the ^ » r.i n ^ 
unambiguous, there is a unique such sequence for each string in the Language 1 . 

Now suppose that each elementary choice rq has an a priori p rehab hi’ / cf 

being made, if these probabilities arc independent, ih::i ihe probability 1... mull- 
ing string being generated is 

P(*i)P(rrz) • ■ * PM. 

Thus, associating a probability with each elementary choice permitted by the grammar 
induces a probability on each string generated by the grammar. 17c call a grammar 
with such associated probabilities an annotated grammar. 

There is of course no guarantee that the probabilities induced by an annotated 
grammar are in fact the a priori occurrence probabilities of the strings in the 
language. Indeed, an annotated grammar is a modeL cf the processes that in reality 
generate strings in the language. As such, it might or might not be a good model. 

V fe say that an annotated grammar predicts a language if it generates that language 
and induces on its strings their actual a priori probabilities cf occurrence. 17 e call a 
language predictable if there is an annotated grammar that predicts it. Clearly then, 
we can determine the probabilities of the strings in a predictable language if we can 
find an annotated grammar that predicts that language. Further, if we can calculate 
the entropy of the language generated by an annotated grammar, then we will be abLe 
to calculate the entropy of the predictable language. In the following sections we 
develop means for computing entropies from annotated grammars. 

3. Computing Negentrcpy from Grammars 
3. 1 Annotated Regular Grammars 

Vie begin our analysis with a particularly simple class of languages: regular languages 

[C-insburg, Hopcroft & Ullman]. The advantage of beginning with them is that the 

1. More precisely, in an unambiguous grammar diere is a unique leftmost derivation for each siring. 
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grammar for a regular Language can be written as a single nonrecursive prodm tion 
malang use of only a few simpLe operators. These operators are: 

name notation interpretation 



catenation A3 

alternation A'£3 

Klesne star A* 

IQeene cross A* 



an A followed by a B 
an .4 cr a B 
zero or more As 
one or more .4s 



Any regular language can be described by an expression formed from the empty string 
(s), individual tokens and these operators, appropriately parenthesized 2 . Such an 
expression, which defines a regular language, is called a regular expression. 



For example, the regular language of signed, nonnull digit strings is defined by the 
regular expression: 

(+3-00(0313233 v© 43538373339 )*• 



This expression can be read, “a plus or a minus or nothing, followed by a string of one 
or more digits.” 



As discussed in Section 2, to compute the entropy of a language it is necessary to 
know the probabilities of the strings in that language. If we have an annotated gram- 
mar that predicts the language that it generates, then the probabilities of these strings 
can be computed from the production probabilities (choice probabilities) in the gram- 
mar. 



In deriving a string from a regular expression there is only one situation in which a 
choice can be made: from A®B we can derive either an A or a 3. Thus we can anno- 
tate a regular expression by associating probabilities with ail the aiternanas of an 
alternation. T »Ve write the probabilities immediately preceding the alternands that they 

2. We have used *3 instead of the usual +, since the latter could be confused ?nth conditional probabilities, 
conditional entropies, etc. Readers unfamiliar vnth the regular languages and other concepts from formal 
language theory should consult any standard text on the subject (e.g., Gins burg or Hop croft U* Ullman). 



- 5 - 



are associated with: 



p .4 & p B. 

This means that we can chcose sin 4 with probability p , or a 3 v.i_h prob. p. 

(Here, and throughout this paper, we use p as an abbreviation for l-o.; 

Since one of the alternands inust be chosen, their probabLiiti -s must add to unity. 
This is the case above, since p -p ~ 1. It also applies if there are mere that two alter- 
nands. For example, if we have 

P i4 1 3 p ?Ao 3 3 p n - Ip. 

then we must ha vep^p^ • • - 1. 

The following sections develop entropy formulas that can be recursively applied to 
any annotated regular expression to compute the entropy of the regular language 
predicted by that expression. When these results have been obtained we vail show that 
they can be easily extended to the computation of the entropy of any context-free 
language from a grammar that predicts it. 

3.2 Entropy Formulas 

We derive a series of formulas that can be applied recursively to an annotated regular 
expression to compute the entropy of the language predicted by that expression. In all 
cases we assume that the regular expression is an unambiguous grammar (i.e., there is 
only one way to generate a given string), and that the choices leading to a given string 
are independent. 

We begin with the simplest regular expressions, the empty string and individual 
tokens, and proceed to the catenation and alternation operations. 

Theorem: If e is the set containing just the empty string and r the set containing 
just the individual symbol r then 

E(s) = H(t) = 0. 
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Proof: Ey definition L{t) - \il and L(f) - \r]. Since there is only one symbol in each 
of these languages, its a. priori probability is 1. Hence, 

H(t) = H(r) = llg 1 = 0. 

Theorem: H(.-W) = H {A ) H{b). 

Proof: Suppose 

L{A) = ja lt a 2 , . . . j, 

L{5) = j. 



Then 

L(-13) = . cii-zi^A), ;3 j€L(BY.. 

Let Pa(o-x) be the probability of choosing a t from L (A ) and Pj(fj ) the probability of 
choosing from L{3), then, since re are assuming these choices are independent, the 
probability of choosing from L(AB), P.is(otj/?j), is just P^(a,JP^(3^). .\o;v, let 

Pi = P r i(a t ) and q x = Pg(j3.). Then, by factoring and distributing: 



H(A3) = yP^a^) XgPAgia^j) 
iJ 

= £Pi(a, )lg[P, (a, )P9(ft ) ] 

•J 

i ; 

= I>i[£?jig(p-.gj)] 

i J 

= £?*[£?; fig Pi - lg ?;)] 

; 

= ZPi[£?jlg A + ?;]■ 

i j j 

Now, since “ - smd H{3) - g ; - lg g ; -, 

; j 

H(AB) = IftCigPi + /?(s)] 

l 

= S>.igPi + 5>#(s)- 

i i 
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Since = 1 and H(A) = Vpjg p it we have 

i i 

H(AB) = H(A) + 77(5). 

QE.D 

Ee£nit:cn: The n-fcld catenation cf A '.nth itself, .-l n , is defined: 

d C = £. 

d 1 = .4, 

d n + 1 = AA n , for n>0. 

Corollary: H(A n ) = nH(A). 

Proof: V.'e prove the result inductively. 

77 (d c ) = 77(e) = 0 = 0 /7(d). 

Similarly, 

77(d l ) = 77(d) = 1 77(d). 

Proceeding inductively for n>0, 

77(d n+1 ) = #(dd r ") 

= 77(d) - 77(d") 

= 77(d) + n77(d) 

= (n + l)77(d). 

Q.E.D. 

Theorem: H\pA ® p3 { = H(p,p) + pH (A) 4- pH(B). 

Proof: Let G = pA QpB. Then, to generate a string in L(G) we must make a choice; 
with probability p we pick a string from L(d), '.nth probability p we pick a string from 
L(B). Let a be a string in L(G). Since we are only considering unambiguous gram- 
mars, o must have come from either L(A) or L(3). Suppose that creL(d). Since the 
probability of a selection from L{A) is p , and the probability of getting a when a selec- 
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Lien is made from L(A) is Pa{g), the probability Pq(g) cf selecting a from L \ ' u 
PPa(v )• Similarly, if geL(B) then P-(g) = PPb(g). These observations allow us to com- 
pute the entropy of C. 

H{G) = £ F^-)lgPj,o) 

<7€L{G) 

- S ^o( J ) U Po( J ) ^ P'A J ) ^ ^g( j ) 

ceL{A) g£L(3) 

= s °c(ctj lg Pc(c n ) + 3, p c(.3j) lg Pc(? j) 

T- j 

= I>- p ~( G i) lg + ’ZpPai.&j) lg ?P3(Pj) 

l 

= T.PP-- 1 ? ??'- + S- 5 ?j ig 

i ; 

= pHAlgPA + p'Zlj'^Plj 

' j 

= PEpi(lgP + lg Pi) + PE?jOgP - lg ?;) 

* ; 

= P&Jg? + I>.lgPi] +p[y?;lgp + H?;lg ?;]• 

t t j j 

From the definitions of H(A) and H(B) and the fact that the p x and g t sum to 1 we get 

H(G) = p[lg p + H(A)] + p[lg 5 + /7(i?)] 

= Pig? + PlgP +pP(A) + pH(3) 

= H(p,p) + H(A) - /?(5). 

Q.E.D. 

This result is easy to generalize to the n-fold alternation: 

Theorem: The negentropy of an n-foid alternation can be computed: 

HlpxAxSpzAzQ ■ ■ ■ 0 p r . --in ) = B(P\,Pz. ■ . Pn ) + '£ l PjH(Aj). 

J =1 

Proof: Let G = z> XJ 4 j. S pzAz S ■ • S p n . For each <x Z jZL(A } ) let The 

proof is a simple generalization cf the previous: 
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H(G) = V P G {&) Ig P c (a) 

aei'G) 

= 5 E p c(°) lg Pc(o) 

i - i cs'( u) 
n 

T -1 'O i 

= ^ ^ r v <7 _ ; lg pj r7 : j 

/=iT~ ' ~ ' *"■ 

j=i i 

= £pj[Sgi.j( l ? Pj + ig ?t.j)] 

;=i 

= EPjtIWgP/ + ?•:./]• 

J=1 i i 



Since = 1. #(Pi.p2 p n ) = '£pjlg Pj and H(Aj) = £g t jlg 

i ; = 1 i 

ff(G) = Y1 Pj [lg Pj +H(Aj)] 
j = i 

= Spjlgpj + 'tpjH(Aj) 

;=i ;=i 

= ^(Pl.?2 Pn) + SP;^;)- 

J=1 

Q.E.D. 

The © operation is associative, that is, 

A&(B$C) - A®B<£C - {AQB)<&C. 

We would expect the annotated version of this operation to also be associative: 

p/l ®p(qB © qC) - pA © pqB ® pqC. 

Thus, if our negentropy formula is correct, we should get the same value for the negen- 
tropy of each of these regular expressions. 

Theorem: H\pA © p(qB 3 qC)\ = H\pA ®pqB <B pqC\. 

Proof: We derive the negentropy of the right-hand expression: 

H\pA © pqB © pgC| = H{p ,pq ,pq) + pH (A) + pqH(B) + pqH(C). 
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Next we derive the negentropy of the left-hand side and show it equals the e; prm.r. n 
above: 

H\pA ZB p(q 3 9 qC)\ 

= 4- 4- m ~ < ^9^ nT\ 

- - u / j. v- - / i u- ~ i -' ) 

= £(p,p) + p//(.4) +p[#(g,g)+?#(F)+g/7(C)] 

= H(p,p) + pH(A) + pH{q,q) pqH(B) + pqH(C) 

= H{p,p) + pH(q,q) + ptf(4) + pqfi(S) r pqH(C). 

Thus it remains to show that 

H(p,pq,pq) = H(p.p) t pH(q,q). 

Expanding the right-hand side above, rearranging, and recalling that g +g - 1, we get 

H{p,p) + pH (q ,q) 

= pig p -r plgp + pglg q + pa lg q 
= pig P + P (? +5 ) lg P + P? lg q +• pq lg q 
= pig P + P7 lg P + pq lg P + P? lg 7 + P7 lg 7 
= pig p + pg (lg p + ig 7 ) + pq (lg p + lg 7 ) 

= pig p + p?ig Pq + p?ig pq 
= ^(p.pg.pg)- 
q.e.d. 

We now consider the iterative constructs in regular grammars. The KLeene cross, 
.4*. means one or more 4s. Thus .4* can be expanded as the infinite alternation: 

A+ = A 9 A z S 9 

It can also be defined by the recursive formula: 

A+ = A 9 AA+. 

This kind of regular grammar is converted to an annotated grammar by adding a con- 
tinuation probability p : 
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/4 P + = pfi fi pp.l 3 © p"p<l 3 A) 



or in its recursive form 

A p ¥ - pASpAA**. 

Y/e mil derive the negentropy formula two ways, using both the infinite alternation and 
recursive definitions, and show that we get the same result. 

Theorem: H\A py \ = [H(pp) + H{A)]/p. 

Proof: First we use the recursive formulation: 

/l- 31- = pA -BpA-i- p< \ 

Taking the negentropy of both sides we have: 

H\A pr \ = H\pA $ P-A-'P r \ 

= F(pp) +pH(A) ~ pH\sl[ p *\ 

= H(vp) + pH (A) + p[H(A) + H\A p¥ {] 

= H(pp) r pH(A) + pH (A) 4- pH\A p + \ 

= H(p.p) + H{A) +pH\A p + l 

Solving now for H\A P + j: 

(l-p)H\A p *\ = H(p,p) + H(A). 

Hence, 

H\A p +\ = [£(p,p) + H(A)}/p. 

Q.E.D. 

Next we compute the negentropy directly from the infinite expansion of the itera- 
tion: 

H\A p + \ = H\pA SppA 2 <Sp 2 pA 3 9 ••• J 

= Hippp.pp 2 . • • • ) + pH(A) - ppH(A z ) + pp z H(A 3 ) + ■ . 

Recalling that H(A n ) = nH(A), 
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= Hip.pp.pp 4 , . . .) + p[fi(A) l- ?.pf!(A) +• 3p*H(A) + ■ ] 

= H(p,pp,pp 2 , ...)+■ p[L + 2p+2p z + ■ ■ ■ }H(A). 

Now note that if p ) <1 the power series expansion of 1/ p z is 

i 

= (l^F = 1 ^ " ' ' ' • 

Therefore 

H\A p¥ \ = H(p,pp,pp 2 , . . . ) + p(l/p-)H(A) 

= H(p,pp,pp 2 , • ■ H(A)/p. 

It remains to simplify Hip ,pp ,pp 2 , . . ). 

H(p,pp,pp'\ . ) = pig? +ppl£pp + £? 2 l gpp 2 + • 

30 

= Epp'^spp* 

b =0 

30 

= pEp* 1 ? p\ 5 

k-Q 

oo 

= p ? h ~ i§p] 

b =0 

= j5[£p*lgj>* x Ep^S ?]. 

Jb=0 k - 0 

Now note that if ,p <1 the power series expansion of 1/p is 

l/P = 73“ = l-rp+? Z +p 3 + ■ • = £p !: . 

l ^P b =0 

Therefore, 

H(p,pp,pp 2 , • ■ • ) = p[E*P fc lgP * (l/p)lgp] 

b =0 

OO 

= [?ig p E *v?" ] + ig p 

b =0 

go 

= [pp ig V 2 + Ig P* 

Using again the power series expansion for 1 / p 2 we have 
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H (p Pp .CD . ) = ppigpd / P Z ) + lg P 
= (pig pj/p + lg p 
= (p lg p + pigp)/p 
= //(p»p)/p. 

Therefore, 

= H(p,p)/p + H(A)/p = [H (p.p) + H(A)]/p. 

Q.E.D. 

The Kleene star, A*, means zero or more repetitions. Thus it can be defined by the 
infinite expansion 

A* = A°$A l © A 2 © • • . 

where ,4° = s and A 1 = A. Since the expression following the first © is just the definition 
of A ¥ , the above equation can be written 

A* = z®A\ 

The Kleene star can also be defined recursively: 

A* = S&AA*. 

This notation is annotated by attaching a continuation probability p to the star: 

A p * = pe © pAA \ 

The following theorem defines its negentropy. 

Theorem: H\A P *\ = [H(p,p ) + pH(A)]/p. 

Proof: There are several ways to prove this result, corresponding to the alternate 
definitions of A*. 

(1) First we derive the negentropy of A p * from the negentropy of A p Since 

A p0 = ps®pA?+ t 

we can apply H to both sides: 
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Til'll = Fl\pz 3pA?*l 

= H(p,p) + pH{t) + pH\A ? * j 
= Hip.p) + p[H(p,p) + H(A)]/p 
= \pF (p ,p ) + pH (p ,p ) + H (. i ') ]/ p 
= [77(p,p) + p#(.4)]/p. 

$ £■./?. 

(2) We can also compute the negentropy from the recursive equation 

A pt = pz<rpAA*‘. 

We apply H to both sides and get 

H\A P ’\ = H'tpz $pAA p ’\ 

= H(p,p)+pH{t)+pH\AA?'\ 

= H(p,p) + p 0 + p[77(4) +• 

= H(pp) +pH(A) + P H\A p ‘ j. 

Grouping the unknowns on the left produces 

(1 -p}HlA p -\ = H(p,p) ^pH(A). 

Recalling thatp = l-j> we have 

#M p 'i = [#(p.p) + ptf(.4)]/p. 

Q.E.D. 

(3) Finally, we derive the negentropy formula from the infinite expansion 

A p * = pz ®ppA Qp 2 pA z ® • • • . 

Take the negentropy of both sides to get 

H\A P '\ = H\pz 9 ppA tj p z pA z © p z pA 3 © j 

= H (p ,pv ,p z p ,p z p , . . . ) +ppH(A) -r p 2 p#C4 2 ) + p 3 p/f (.4 3 ) 4- 
= H (p ,pp ,p z p ,p z p , . . . ) + pp(l + 2p + 3p 2 + • • - 
= H (p ,pp ,p z p ,p z p , . . . ) + (p/p)H(A), 
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where we have used 1/ p 2 = 1 + 2p + 2p 2 + * . T ,V r e have already shown in the _ 



tion of H \A P r (, that 

H{p,pp,pp z , ...)=■ H(p,p)/p. 

Therefore we have 

H\A p, i = Hip.p)/ p + (p/ p)H(A) 

= [H(p.p) -f pH(A)]/p. 

Q Z. D. 

V/'e can check these results by computing the negentrcpy of A ? * based on equation: 

t* _ \ ] p* 

Applying H to both sides vre derive 

H\AP*\ = /7U4?'J 

= H(A) + H\A p '\ 

= H(A) + [H(p,p) +?H(A)]/p 
= [. H(p,p ) pH(A) +pH(A)]/p 
= [H(p.P) + H(A)]/p, 

whic h checks with our previous result. 

The formulas for computing the negentropy (and hence entropy) of a regular 



language are summarized in Table 1. 

TABIdC 1. Formulas tor Negentropy of Regular Languages 



HW 


- 


0 


H\t\ 




0 


H\AB\ 


- 


H(A) + H(3) 


H \pA ® p3 j 


= 


Hjp.p) + pH (s\) + pH (3) 


H\A? + \ 


= 


[H(pp) + H(A)]/p 


H\A p '\ 




\H(p,p) + pH (A)]/ p 



3.3 Examples 

In this section we illustrate the application of our negentropy formulas with several 
simple examples. Severed of these examples are based on free Languages: 
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X. 



EcfLaiticn: The /res language on an alphabet T is the set of all finite ft: i: 7 
mg the empty string) of elements of T . 

Thus T * is the free language on T. In most cases it does no! matter v.l ;.l .e mpiia- 
bet T is, so v/e speak cf the free language cn n symbols. Let. 7b repr:_:nt am alpha’- : 
of n symbols: 



Then F n , the free language cn n symbols, is defined 









9-i) 



Of course, before me can compute the entropy cf a languors 
grammar mith probabilities. Therefore, the annotated grammar 
cn n symbols is 

F n — (?i^i a - 5 2~ 2 3* ^ 'In ~n h • 



v:c must annotate ' 
for the free language 



Theorem: The negentropy of the free language on symbols, 



Fa 



(9 l T t 3 



^ en 71 > * 



H(F n ) = [//(?, p) + pH(_q u . , ?r.)]/P- 
Proof: Vie simply apply the formulas from Table 1: 

H{F n ) = Sa n ' n ) p ‘l 

= [#(P.P) - 3 ■ Sg r .- n ;]/p 

= [H(j).p) - p\H(q : gn) + 7i^("]) + + qp.H(r n )l]/p 

= [^(p.p) +pj?(gi 7n)]/p. 

Q.E.D. 



The free language on one symbol - is just the set of all strings cf rs: 



L(Fi) = \c,~ rr, rrr. 
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The following theorem defines the negentropy of F \ 



Corollary: The negentropy of the free language with continuation probability p on 
one symbol is: 

- B( P,p)/p. 

Proof: We simply use the previous theorem with n = l. 

Corollary: The negentropy of the free language with continuation probability p on an 
alphabet of n equally likely symbols is 

[H(p,p) - pig n]/p. 

Proof: To derive this simp l;' set o l = 1 / n in the negentropy formula for F n : 

H(F n ) = [H(?,?)+pH(q i q a )]/p 

= [H(p,p) + pH(l/n l/n)]/p 

= + pV(i/n)lg(l/7i)]/p 

1 = 1 

= \H{p,p) + p\g{l/n)]/p 
~ \H(p,p) - pig n]/p. 

Q.E.D. 



Table 2 shows the entropies of free languages on equally likely symbols for several 
different continuation probabilities. 



TABLE 2. 


Entropies 


of Free Languages on 


Equally Likely 


Symbols 






p\n 


2 


4 


3 


10 


12 


64 


255 


0.1 


0.63 


0.74 


0.35 


0.39 


0.92 


1.19 


1.41 


0.2 


1.15 


1.40 


1.65 


1.73 


1.80 


2.40 


2.90 


0.3 


1.69 


2.12 


2.54 


2.63 


2. SO 


3. S3 


4.69 


0.4 


2.23 


2.95 


3.62 


3. S3 


4.01 


5.62 


6.95 


0.5 


3.00 


4.00 


5.00 


5.32 


5.53 


3.00 


10.00 


0.6 


3.93 


5.43 


6.93 


7.41 


7.30 


11.43 


14.43 


0.7 


5.27 


7.60 


9.94 


10.69 


11.30 


16.94 


21.60 ’ 


0.3 


7.61 


11.61 


15.61 


16.90 


17.95 


27.61 


35.61 


0.9 


13.69 


22.69 


31.69 


34.59 


36.95 


58.69 


76.59 



This table suggests that we consider the special case in which n is a power of two and p 
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is one half. This leads to: 



Ccrcliary: The negentropy of the free language with continuation probability one 
half and 2 k equally likely symbols is — lg .1* - 2. Conversely, the entropy is !g 2 -r 2. 

proof: Let p - and n = 2^ in the formula from the previous corollary and we ka ~e 

H = -pig 7i ] / p 

= gsq/,4 

= 2DSLgJ* + !nsfc-fc/2] 

= 21g — lg k 

= -2 - lg k . 

Q. Hi . D. 



3.4 Cera puling the Entropy of a Content-Tree Gr amm ars 

In this section we extend the results of the previous sections to the computation of the 
entropy of any context-free grammar. 

As usual, we define a context-free grammar G to be a quadruple, 

G = <T, y, P, t / o>, 

in which T is a finite set of terminal, symbols, M is a finite set of nonterminal symbols, 
Vr>£N is the goal symbol, and P is a finite set of productions, 

P c N X (T u N)\ 

That is, each production is a pair of tlie form <n,ct>, m winch v is a nonterminal and a 
is a finite string of terminals and nonterminals. Such a production ls usually written 
'v -* a'. The Baohus-Naur form (BNF) of a context free grammars combines all the pro- 
ductions for a given nonterminal into a single productions. For example, if context- 
free grammar contains the following productions for v\ 



v -* ai 



-1 CL 

■L w 



V 0-2 



V «n 

then the CNF form ci' i his ^rarnnidr combines them into a single production: 

v -> ct x 3 ci 2 3 * * * <& c< n . 

In the following discussion we veil! usually use the BhF form of grammars. 

The characteristic that distinguishes context-free grammars from regular gram- 
mars is that the productions of a context-free grammar can be mutual!;’ recursive. 
That is, a nonterminal v can be defined in terms cf a string that is, directly or 
indirectly, defined In terms of la It is well known, however, (see Ginsburg) that each 
production in a 3XF grammar can be considered an equation on mts cf strings. If we 
recursively define L(a), the language defined by a, as fellows: 

I(s) = \e\ 

Ur) = \r\ 

L(ap) = L(a)*L(f 3) 

where S&T - \a(3 a€5, 

L(a © £) = L(a) u L((3) 

then each production v-*a of a context-free grammar G can be transformed into a 
corresponding equation 

L(u) = L(a). 

Let G - <T t .\ r ,P t v c> be a context-free grammar, in which 

P = \v 0 CXq, L/i -> OCi , l/jc aicl 

is a set of productions in BNF form. Then corresponding to P is a collection of simul- 
taneous equations on sets of strings: 

L(u' C ) = I (cx G ) 

L(v i) = Z-(cti) 
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L(v k ) = L(oc fc ) 



The solution to this set of equations defines the language generated by G. sl c 

L{G) = L{y z ). 

Context-free grammars can be annotated with arc due lien prcbabiliti-o in the nnn 
way as regular grammars. Formally, we define an annotated context-free grammar G 
to be a quadruple <7\ N , P, v 0 >, in which •‘and 

P z R X V X ( T 'J -V) *. 

Thus each production is a triple <p,i/,a>, p being a real number representing the pro- 
bability of applying the production Y,'e impose the restriction that all the proba- 

bilities associated with the productions for a given nonterminal must sum to unity: 

^ Pi = 1, for I'Z.V . 

<p, a,-> € P 

This is simpler to see in the EXF form of an annotated context-free grammar, in any 
production that is an alternation, 

V -* p L a L Sp 2 «2© • 

we must have that 

n 

£?-: = 1. 

1 = 1 

Consider an unambiguous annotated context-free grammar G and let E = L(G) be the 
language generated by G. Let Pq(o) be the probability that a string a is generated by 
£. We say that E is predicted by G if for every string a, Pv(a) = Pc (a), that is, the 
observed probability of occurrence of a is the same as the probability of its generation 
by G . We now consider how we might compute the negentropy of E from G. 

Consider a production i/-*a in the annotated grammar; this corresponds to an equa- 
tion L(v) = L(a). Since i /-»a, the probability of a string being generated from v is the 
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same ai its probability of being generated from a. Thus, P u (cr) = P a ( 7), dr m 
a £ d(y) = 1(a). Thus, the negentropy of L[y) % which we can write H(v), is the mme 
the negentropy cf 1(a), which we can write //(a). That is, 

£(-'} = £{*)• 

It can be seen that ccrre sc ending La trie Tr.T orcduc ti - ne r t in d d n __ ^ _ t 

simultaneous equations 

77(y c ) = #(«c) 

H(yx) = fffo) 

#(^n) = #(a n j 

that can be solved to yield the negentropy of the language predicted by the grammar. 
In particular, 

H{Z) = H(G) = 

We have already made used this technique in applying the recursive definitions of of A ? * 
and A p+ to solve for their negentropy. In summary, the methods developed previously 
for computing the negentropies of regular languages can be extended in the obvious 
way to context-free languages. 

4. Determining the Production Probabilities 
4. 1 lleasiirable Properties 

To compute any specific entropies we need to know the probabilities of applying the 
productions in the appropriate grammar for the language. These can be obtained by 
determining measurable parameters whose values are Implied by the production pro- 
babilities. That is, the measurable properties are a function of the production proba- 
bilities. The production probabilities can then be determined by (analytically or 
numerically) inverting this function. 
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Vfhat measurable properties should we use? One of Lhe simplest is the 
occurrence of a token. Let Occ T (a) be the number of occurrences of the symbol r m 
the string a. This is formally defined: 

Ccc r (s) = 0. 

Occ T (r) = l t 
Occ T (r') = 0, for t?±t\ 

Occr(cr) = Occ T (a) + Occ r (p), for a = 

The A r (G), then density of occurrence of r in the language generated by G is 



A r (G) = 



V Pa{<y)0cc.{(7) 

czi 0) 

V r-yO G 
cZL^G) 



where Pq(g) is the predicted probability of generation of a and a is the length of c. If 
G predicts L(G), then Ar(C7) wiil be the observed density of occurrence of r in 
languages generated by G. 

The formula for \(G) suggests two useful properties of a grammar: the average 
length of the strings it generates and the average number of occurrences of a token in 
a string. T .Ve let A (G) be the predicted average length of the sirings generated by G: 

A(G) - L p o(°) i J • 

cel{C) 

We let <l r (G) be the predicted frequency of occurrence of r in the strings generated by 
G: 

$ r (G) = V p c ( a ) Occ t (j). 

crel(C) 

It then follows that 

Ar(G) = $t(G) / A(G). 

The goal then is to find ways to compute $-(G) and A (G) from G. This will permit us to 
calculate predicted values of At which can be compared with actual measurements. 
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Therefore, the next two sections present means for calculating A (G) end 
4.2 Average Siring Length. 

Vie begin again v.ith regular grammars. 

Theorem: The average lengths of the empty and single token grammars are defined: 

A(s) = 0 tokens, 

A (r) = 1 token. 

Proof: Obvious. 

For the remaining derivations we need some notation. Suppose that 

L(A') - \ci i.ca, A 

L{3) = • • • i, 

bj = Pstfj). 

Then it follows that 
A(.4) = £ a*] at!, 

I 

MB) = E&jlfcl. 

J 

Theorem: The average length of the catenation of two grammars is the sum of their 
average lengths: 

MAS) = a (. 4) 4. A(f?\ 

Proof: Note that is gzL(AB) then a - a t /?,- for some a,€l(A), j3 ; -£L(i?). Assuming as 
usual that the choices from A and B are independent, 

Pab(.°) = PaMp b {P}) = 

We now derive the average length: 
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A (AB) = yyPADMi)\a^j\ 

* ; 

= ES^ b ;(l^l + !/?; !) 

* ; 

= S“i[— 6 ; lo < + ll b ; 'P: ] 

’«■ J 

- + a(5)] 

t 

= Vn„ : a, + Vcz,A(F) 

t 

= A<^) + A(5). 



Q.E.D. 

CorcncL-7: A(.4 n ) = nA(.-i) 

Theorem: The average length of an alternation lg the average of the average length: 

of the aiternands: 

A (pASpB) = pMA)+p\(3). 

Proof: Let G = pA 3p3. Recall that if a€L(G) then either o€L(A) or gzL(B), and that 
the choice from A is made with probability p. Therefore 

Pc(°) = pP,\(a), U g^L(A), 

P c {a) = PP 3 {g),iIgZL{B). 

Then we derive: 

A(G) = Y Pc(°)\g- 

<re.L{C) 

= S^c(Ot) !<*». +lPs(ft)l/3;! 

I > 

= ! «i I + T.? b i I Pj i 

i ; 

= pEaiiatl + pE&j!&! 

i ; 

= pA(A) + pA(5). 
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Theorem: A (A :>¥ ) = A(.l)/ p. 



Proof: We appeal to the Infinite expansion: 

A p+ — pA S ppA 3 £ p 3 pA J 0 ■ . 

Applying A to both sides: 

A(/i p<_ ) = pA(A) +ppA(A 2 ) + p 2 pA(A 3 ) + • • 

= ?[1 + 2? + 3p 2 + ■ • • ]A(A) 

= p[l/ S-]A(A) 

= A(A)/p. 

Q.E.D. Alternately, we can appeal to the recursive definition: 

A(A ? M = pi\(A) + pA(AA ? *■) 

= pA(A) + p A{A) + pA(A p *). 

Grouping like terms gives 

(l-p)A(.4 p+ ) = (p+p)A(A). 

which leads directly to the result. Q.E.D . 

Theorem: A(4 P *) = (p/p )A(4). 

Proof: We apply A to the infinite expansion of 4 P *: 

A(4 P *) = pA(s) + ppA(A) + o 2 p A(4 2 ) + p 3 pA(A 3 ) + ■ ■ 

= p- 0 + pp A(4 ) + p 2 p’2A(A) + p 3 p-3A(A) 

= PP(l + -P + 3p c + ■ )A(4) 

= pp(l/p 2 )A(4) 

= (p/p)A(4). 

Q.E.D. 



Alternately, we can apply A to the recursive definition: 
A (4 P *) = pA(s) + pA(44 p *) 
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= p Q+p\(A) + P A(,\n. 

Solving for i\(A ? ‘) we get 

A(<T S ') = (p/p):\(A). 

Q.E.D. 

We consider some simple examples based on free languages. 

Theorem: Consider the free language on n symbols generated by: 

F n = (?1U ® g 2 -e® :S?n"7i) P *- 

Tne average length of the strings of this language ls 

X - A (F n ) - p/p tokens. 

Proof: Since t \(~ x ) = 1 and g j +- + q n - 1, 

A {F n ) - (p/?)[q iA(r,) + + 5»A(- n )] 

= (p/p)iq i + • + q n ] 

= p/p- 

Q.E.D. 

Notice that the average length of a free language is independent of the number of 
symbols in the alphabet. This is to be expected. 

Corollary: The average length of a free language with continuation probability ){ is 1 

token. 

Proof : Apply the previous theorem with p = p = & Q.E.D. 

TA3L5 3. Average String Length for Regular Grammars 

A(a) = 0 tokens 

A(r) = 1 token 

A (AB) = A (A) + A (3) 

A (pA SpB) = p A(A ) + p A(S ) 

A(A**) = AU)/p 

I A(A P *) = (p/ p)A( A) 

The formuLas for computing average string length are summarized in Table 3. 
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*L3 Average Token Frc:\\ichcy 



The formulas for computing average token frequency are almost identical to those for 
average length. For this reason the proofs are omitted and the results me shown in 
Tabic t. 

T ^ "T ^ ir .°r2 nr 6 T'pi-pp p^orr - nr ,T ’ r nr ubr Grammars 

Mr) 

Mr') 

$-(A£) 

$ r(p ® pF) 

*rU ?+ ) 

MA pt ) 

Theorem: Consider the free language on n symbols: 

F n = (7,7! S q 2 r z © ■ q n r n )P\ 

The frequency of occurrence of token ~ x is 

<Pt = §T.Mn) = qiP/P- 

Proof: Vie derive as follows, abbreviating ( P Tt by $•: 

(^n ) — (P /P )*?i [*7 j7, © ’ “7 n ”n ] 

= (p/p)[7i¥i(r,) + • • ■ + 3i l ?i(70 + + g rl < P-.(7 n )] 

= C°/p)[?l 0 + • • • + 7i 1 + ■ • - + 3n-0] 

= (p/p) 7i- 

Q.E.D. 

Corollary: In the free language of the previous theorem, the density of occurrence of 
symbol r x is q % . Vie denote this measurable property 5 X . 

proof: Since E(F n ) = $i(F n )/ A(F n ), we have 

— '~p x / \ 

?;P/P 

p/p 

= ?i- 



0 

1 

0, for T7*t' 

Ma) + MB) 
pMA)+pMF) 

<b r (A)/p 

(p/ olq.f .i) 
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Q.E.D. 



The following corollary shows that for the case of a free language it is easy to com- 
pute the production probabilities from measurable properties of strums in ! h~ 

1 - rt -7n r> rr n 
A - — 0 — - • 

Corollary: If n*e measure the properties 5 k , <5>, " u and . of a free en : 

symbols, then we can compute the probabilities g go, .... q n and p from them by the 
formulas: 



7i = o : , 

p - X/(A+1). 

Proo/; The formulas for q l are obvious. Forp we know that 

\ = v/p = ?/(i-p) 

Therefore \—p\ = p, so A = p +p\. Thus A = p (A+ i), so p = A/(A-rl). 



Corollary: The negentropy of a free language exhibiting occurrence densities <5( 

<5 n and average string length A is: 

H n = H(\. A+ 1) + . .<50. 

Proof: To derive this result '.ve take the formula for the negentropy of a free language, 

H n = H(F n ) = [H(p.p) + ?#(?!. . .q n )]/p, 

and substitute the values for g t and p derived in the previous corollary. To do this, 
note that 



P = 1-P 



1 - 



X 

X+ 1 



1 

A+ 1 



Then we have 
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J 

H I 



fin “ 



| \+ 1 



V T= - - 



A+i 



■-hfu,. 



.On) 



= (X+l) 



A-t- 1 '= AtI 



i i 

— L a — 

Afi =.\ 



tr* xfr-o. fe) 



= A lgT— — lg(A-f-l) + A h'idi. ■ ,0n) 

A-r L 



= A !g \ - \ lg(A + ! ) - lg( A- l) r \H(o t . , o~ ) 



= A lg A - (A-rl)lg(A-f i) - AA/La , d n ) 



= H{ A, A+l) + A^ (d i , . . , <5 n ). 



QE.D. 

Thus we have the negentropy (and hence entropy) of a Language e:cpressed entirely 
in measurable parameters. 

4.4 Average Information per Symbol 

Recall that the entropy of a language measures the average information born by each 
string in the language. That is, 

H( I) = Vp i ( CT )/v(cr). 

However, each of the strings of the language is composed of a number of terminal sym- 
bols (tokens). Therefore, it is interesting to compute the average information born by 
each symbol (token) in the strings in the language. We call this the information 
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dtmzity cf the language. 



Information density is easy to compute: it is simply the average information born 
by the strings of the language divided by the average Length of those strings: 

??(E) = h {,£)/. Hi), 

where we have used p(2) for the average information born per symbol in the strings 1. 
The units of information density are bits,- token. 

If the grammar G predicts the language L, then 

>?(-) = r,(G) = H{G)/:i5). 

Vfe use this result to compute the information d-nhly for .:e"eral u-nyj.gn. 

Theorem: Let rh be the free language "’nth continuation probability p on n symbols 
with probabilities q z . Then, the information density cf f-h is: 

rj(F n ) = H(p,p)/p + .¥(?!.... . q r .) bits/ token. 

Proof: Take the formula for the negentrcpy of F n and negate it to get the entropy for- 
mula: 

H{F n ) = [H(j>,p) ■ L pH(q l . . ,q n )]/p. 

Divide this by the average length A( F r _ ) - p/p to get the information density: 

v(F n ) = H(F n )/ A(f n ) 

[Hjp.p) ^ pH(g u . -.?»)]/ P 

p/p 

= H(p,p)/p + /?(?! 5a). 



Q.Z.D. 

Corollary: The information density of the free language on one symbol with continua- 
tion probability p is: 

t)(F ;) = H(p,p)/p bits/ token. 
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Corollary: The information density ci the free language v/ith continuation prcLubidty 



p on n equally likely symbols is: 

r, - H(p.p)/p + Ig n bits,' token. 

Proof: Vie simply use 0 ; = 1/n: 

r) = H(p,p)/p + H{n ml ?n) 

= H(p,p)/p + H(l/n, . . . ,1/n) 

= H(p,p)/p + (l/n)lg 7i + ■ + (l/7i)lg 7i 

= H{pp)/p +■ Ig 7i bits/token. 

Q.E.D. 

Corollary: The information density ci the free language -.vith continuation probability 
one half and equally likely symbols is k+2 bits/token. 

Proof : Let 71=2* and p = p = in the previous formula: 

V = tf(M)/0$) + lg2* 

= 2 H(&)' z ) + k 
= 2 C£ !g 2 + \k lg 2) + - l - 
= 2(^ + J4) + & • 

The difference between the entropy of a language and its average information den- 
sity can be understood by looking at some simple examples. In particular we will con- 
sider the languages N k of all nonempty strings on k symbols. Thus, N k is just the free 
language F; c without the empty string. Conversely, 

F k = N k 9 e. 

Theorem: Let N k = (q \r x 9 • • © q k r k ) p + . Then, 

H{N k ) = [H(p,p) t H(q i, . . . ,q k )]/p bits 

i\(N k ) — l/p tokens 
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fjf.Vt) = H(p,p) + H{q !. 



7 jt) bits, token. 



proof: Simply apply the previous formulas. Q.E.D. 

To understand the implications of this result we consider an especially :impU Cu.se, 
N !, the language of all nonempty strings on one symbol r: 

iV L = 

Hence L(N i) = $r, rr, ttt, . . 

Corollary: If the continuation probability of .V L is one half, then 

H(N i) = 2 bits 
A(Ay) = 2 tokens 

= 1 bit/ token 

Proof: Substitute p = )£ in the previous formulas and recall that 

^ lg 2 + 14 Ig 2 = !g 2 = 1 bit. 

Q.E.D. 

This result is easy to interpret, since each succeeding token indicates that the 
choice has been made to continue the string. Since the probability of the choice is one 
half, each token conveys one bit of information. 

Next we consider No, which can be considered the language of nonnull strings of 
binary digits: 

N 2 = (0 5 1) f . 

The following theorem addresses the information density of this language. 

Theorem: If .V 2 is the language of nonnull binary strings: 

N 2 - (gO e g ly’V 

then 
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H{. V«) = [//fjo.jo) + iH.q,q)\/p bits 



A(:Vo) = 1 /pJ tokens 

ri(No) = H(j>,p) + H(q,q) bile-, 'token. 

Proof: Apply the previous fcrinuiss. \ - - 

Corollary! Suppose the binary dibits D 2nd. 1 222 eguaiiy likely. Then Lhj _mc rmnLic r 
density of the language of ncnnull binary strings is 

?]( Nn ) = i + H(p.p) bits/token. 

Proof: Apply the previous theorem, v.ith q = q - Jf Q.E.D. 

Note that since H(p,p)> 0, we know 

7 (.Vo) > 1 bit, token. 

Since in this case a token is a binary digit, we have the somewhat surprising result that 
the information density of the language of binary strings is greater than one bit per 
binary digit. How can this be? The extra H(p t p) bits of information per binary digit 
comes from the fact that the binary strings are variable length. T Ae previously saw that 
in Ni the continuation cf the string conveys H{p,p) bits cf information. 

The source of the extra information can be made clearer by considering a language 
in which it’s absent, the language of ail n digit binary strings: 

\V n = (g Q 9 g l) n . 

Let q =7 = Then 

H(Wn) = ”■#(&%) = n bits 

A( W n ) = nA$gO@gl| = n tokens 

T)( ?/ n ) = n/ n = 1 bit/token. 

Thus the information density of W n is one bit per binary digit, as expected. Since all 
the strings of P/ n are the same length, no information is conveyed by the continuation 
of the string. 
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Consider again the language cf nonempty base k strings: 

Mb = (giM© - ■ • ®q*T k y+. 

We have seen that its information density is 

rjiNk) ~ H(?*P) + //(?!• • • • . ?jb) bits/tokan. 

Now we can see that this result is Intuitive, since each additional symbol in a string 
conveys two pieces of information: the decision to continue, H(p,p) bits, and the sym- 
bol chosen for the continuation, H{q i( , cy ) bits. 

5. Lmcrmation Theoretic Properties cf Grammars 

In this section we consider two information theoretic quantities that are properties 
of grammars, as opposed to properties of the languages predicted by those grammars. 
These properties are the average length of a derivation from a grammar and the aver- 
age amount of information consumed by a production in a grammar. 

5. 1 Average Derivation Length 

First we consider the average length of a derivation from a context-free grammar G, 
that is, the average number of productions that must be applied in going from the goal 
symbol i/ 0 to a terminal string a. We apply similar techniques to those previously intro- 
duced, transforming the set of productions into an equivalent set of simultaneous equa- 
tions. 

We will let D(ci) represent the average length of a derivation from a siring x of ter- 
minals and nonterminals. Nov/ consider an arbitrary BNF production 

v - P\d\ © • • • Bp n a n 

in P, the set of productions in G. We want to compute D{y) t the average length cf a 
derivation from v. In deriving a terminal string from a string containing v, we apply 
the production !/->ce with probability p z . If we apply then the length of the 

derivation is one plus the length of the derivation from oq. The same holds for each 
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Thus the average length of a derivation from v is 



D(u) = p[[l + D{a x )} + • • • +p„[l + 0(a n )] 

= Oj 4 ••• 4 p n ) + PiD(ai). 4 4 p n D(a n ) 

= 1 +p,£?(a;) + p r> D(?.,- t ). 

This result is intuitive: the constant 1 accounts fcr that fact that re must apply seme 
production to eliminate the v\ the remainder of the terms are the weighted average of 
the average derivation lengths of the alternands. 

Next we derive a number of rules for simplifying the right-hand sides of these equa- 
tions. If a is the empty string, then no productions can be applied to it, so 

D(s) = 0. 

If a begins with a terminal symbol r, a - r(3, then, since no productions can be applied 
to a terminal, we have 

D(tP) = B{§). 

If a begins with a nonterminal symbol ju, a = ju/3, then, since both ,u and must be 
reduced to terminal strings, the average length of a derivation from a must be the sum 
of the average lengths of derivations from ,u and [3: 

D{tf) = D(ji) 4 £(/?). 

We summarize the formulas for computing average derivation length in Table 5. 

TABLE 5. Average Derivation Length for Grammar 

D\v -* PiCLi^ ■■ Spnd-nl = D(v) = 1 4 p\D{a x ) 4 4 p n J(a n ) 

D(s) = 0 
D(r) = 0 

D(a0) = D(a) 4 D(0) 

Theorem: Let F n be the following annotated grammar for the free language on n sym- 
bols: 

F n - pe®pAF n 

A -» q i T l ® ' • ' ® q n ~n- 
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Then the average derivation length of F n is 



D(F n ) = (p + l)/p productions. 

Proof: Vie transform the productions into the equations: 

D(F n ) = l+?D(e) + pO(AF n ) 

D(A) = 1 + + • * * + q n D{T n ). 

Thus, D(A) - 1. Simplifying we derive 

D(Fn) = l+?[D{A) 4 - D(F n )] 

= 1+? + ;>0(/Vj- 

Solving for D (F - ) we ha ,_ e 

Z?(rV) - (P + lj/p productions. 

As would be ei-rpected, the average derivation length is independent of the probabili- 
ties 

Corollary: The average derivation length of with continuation probability one half 
is 3 productions. 

Proof: Apply theorem with p = p = }£ Q.E.D . 

This result is also intuitive. 7/e always must apply at least one production (for 
If we choose to stop, with probability one half, we have applied one production. How- 
ever, if we choose to continue, with probability one half, then we must apply two more 

productions (one for A, one for F n ), and repeat our choice. Thus we have: 

D = A' i + A [2 + Jf i + A (2 + 1 + • •)] 

= A + 1 + )? + /4 + J$ 3 x • • • 

Regrouping gives 

£ = (1-JS+ ■••) + (K + ^ + JJ 3 + -) 
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= 2 +■ 1 productions. 



5.2 Averse Lnlcrmaibon Used by a Production 

Ccnsidm a leftmost derivation cf a terminal string from a grammar. At each pom, 
in the derivation a nonterminal must be replaced by a string according to the produc- 
tions cf the grammar. In many cf these cases there ..all be a chcAe of ' hi eh of several 
alternate productions are to be applied. Such a choice will require information to be 
supplied. Considering the average information that must be supplied per applied pro- 
duction gives us a gauge of the efficiency with which a grammar transforms informa- 
tion into terminal strings. 

Since in an unambiguous context-free grammar there is a unique leftmost derr ca- 
tion of any string in the language generated by the grammar, we can set up a cne-cne 
correspondence between the leftmost derivations and the strings. Thus, for any 
aeA(u) there is a unique series cf productions tt ^ , -n, nv that generates cr. As we 
saw before, if production rr z has an a priori probability P(rr z ) of being chosen, then the 
probability that G will generate a is 

Pc(v) = P(Xl)P(*2) • • P(**). 

Hence, the generation of a string by a grammar can be viewed as a series of choices rq, 
..., 7T fcl having probabilities P(rr l ) P{~k) of being made. 

Now we look at this formula a different way. Recall [Shannon, Hamming] that when a 
previously undetermined situation with a priori probability p is determined, the infor- 
mation conveyed is — Ig p bits. That is, information is conveyed by making choices. 
Thus, the information conveyed by making choice r z , with probability P(rr l ), is 

= -lg -Pfo) bits. 

Therefore, the information conveyed by a is just the total information conveyed by the 
choices that lead to a: 

!c(o) = -lg Pc(o) = -lg P(lTi) + ■ • ■ + -lg P (- k ) = /(ttO + • • • + /(-,)• 
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This information is used. by the grammar in going from an undetermined nonl ^ „ mi* L 
symbol to a completely determined terminal string. That is, this information drives a 
decrease in the entropy from H(G) to 0 (since a terminal siring has no entropy). 

Now recall that 

F7(rr\ = - v P-(n\ Irr PA j) 

azL(G) 

= £ -WW'cW). 

g€.L\G} 

Thus, the entropy of a language is the average information conveyed by its strings. 

A grammar with higher entropy is less constrained, mere disordered, than one with 
lower entropy, so on the average it tabes more information to genera* e a paniculur 
string from it. A grammar with low entropy is highly constrained, so cn the average Lit- 
tle information is needed (or used) in generating a string; there are fewer choices to be 
made. 

We can now apply these results to determining the average information used per 
production by a grammar. Since, the information conveyed by a string is the same as 
the information used in its derivation, the entropy of a language measures the average 
information used by a grammar in generating a string of that language. If we also know 
the average derivation length for that grammar, that is, the average number of produc- 
tions needed to generate a string, then we know the average amount Q of information 
consumed per production. Summarizing, 

Q(G) = H(G)/D(G), 

where H ( G) = H[L(G)]. 

Theorem: Consider the following grammar for the free language on n symbols: 

F'n -* 

A -* q 1 T l © • • • © q n r n . 

This grammar uses on the average 
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Q{f J = [Hlp.p) +pH(q l . 



■ ?ti)]/C d + 1 ) bits/production 



to generate a string. 

Proof: Recall that 

H{F n ) = [H(p,p) +pH(q l , . . ,q n )]/p bits 
D{F n ) = {jo + l)/j5 productions, 

and divide. 

Q.E.D. 

Corollary: F z with continuation probability one half uses on the average one bit of 
information per production. 

Proof: Set p = p = g ; = = •$. Then, 

Q{Fz) = [HQUf) + %#(&'&]/(%+ 1) 

= H0z.lQ 

= 1 bit. 

Q.E.D. 

This is intuitive, since this grammar must use one bit on each production, either to 
decide whether or not to continue, or to decide which symbol to generate. 

Consider F it the above grammar restricted to generate the free language on one 
symbol, with continuation probability one half. The above formula says the information 
consumed per production is 

Q{F,) = + )$ # ( i)]/0$+ i) 

= (1 '£) 

= 2/3 bits/prcduction. 

This might be surprising, since it seems that with each successive symbol of the string 
exactly one bit of information is being used, namely, to decide whether or not to 
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continue. The source of the inefficiency can be found by looking at Ty 

F x -* AF i 

A -» t 

There is a redundant production .4 -» r that uses no information; this decreases the 
£L v * 6 r 2 . rr G information used per production. If *vo eliminate this redundancy, * j get the 
one-producticn grammar 

F\ -* psSprFi. 

It has the same entropy as the previous grammar, but a shorter average derivation 
length: 

D(F X ) = 1 +pD(F x ) 

= 1/p 

For p = Yi its average derivation length is 2 productions, as opposed to 3 for the version 
■frith the redundant rule. The information used per production is then 

Q{Fi) = H(F\)/ D(F i) 

- H(p’P)/ P 

1 /P 

= H(p ,p) bits/production. 

In the case p = % this grammar uses one bit per production, as would be expected. 
Thus we have a way of comparing the efficiencies with which grammars use information 
and of determining whether grammars have useless productions. 

6. Example Ap plica liens 

In this example we illustrate the previously described techniques by their application 
to a nontrivial language. Table 6 shows the context-free grarnmar for lambda-calculus 
expressions; we have added unknown production probabilities. Y> r e now apply H to 
these productions and solve for the negentrepy. Thus, 
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TABLH fi. Annotated Grammar for Lambda Calculus 



E 



I 



= 7 1 / 

® 73 ' (' E E ‘ )' 

= (p t a® pob -B ■ • • ®j>389) ?( ’ 



where 7 1 + 7 3 + 72 = 1 , and p 1 +p*;+ '• • p- s = 1 

THE) = 1.73.73) + + g-ffr 0 A / £’•)’! ?3-^rC r £ 



i 



i 



Tokens can be ignored in computing negentrcpies, so this reduces to 
H(E) - Hiq^q^qn) + g ,/?(/) + q z [H(l) 4 r?(u)] + q r 2H(E) 

= ^(g 1.73-73) + (qi+qz)T?(J) + (7= + 2q„)H{£). 

Solving for H(E) we have 

77/ 1.73.73) + (7 1 +q?)Tf (0 

H ^ ) = i-, 2 -3gc 1 

It remains to solve for //(/). V.'e apply the formula for H{A p¥ ) to get 

/?(/) = i¥^(?;a3 ■ • • d 2s 9)? + ' 

= [^(p.p) + H\piO.±i ■ ■ ■ SP zs 9l]/p 
= [H(p,p) + H(p i.pz ?ss)]/p. 



The resulting formula for the negentropy cf the lambda-calculus is: 



H 



#(?i, 73.7a) + (9i+7?)[#<>..P) + Hipi, ■ 
1 - 7 2 - 2a 3 



■ P33)]/P 



bits. 



To actually compute the negentropy it is, of course, necessary to determine the pro- 
duction probabilities />, g l( 73 , 73 , pi, p?, po G . Since all the probabilities associated 
in an alternation must add to unity, there are just 33 independent probabilities to be 
determined. 



To determine these probabilities v. r e mil calculate the measurable properties cf the 
strings in the language: the average string length and the occurrence densities cf the 
tokens. The average length is: 
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A = A(E) 



= ?|A(/) + <? a A* ('A /£•>'}+ g 3 A< Tl 

= 7 iA(/) + 7s[A<' (■{ +• AJAJ + A (/) + A + AJ’ )'j] + g 3 W(’l 
= 7;A (/) + 72[A(/) + A + 3] g 3 (V - 2] 

= (qi+qz)A(I) + (q 2 + 2q 2 )x + 2 q z + 2g 3 . 



?A +■ A 



Solving for \ we have 

N _ (7i + 7c)Af/) + 2?o + 2g 3 

A = - 7. - 

It remains to compute A(7): 

A(I)=A\(p^v • S3 a C) fr J 

= AJ D [O 3 • $p 36 9|/p 

= (Pt+ • +P3fi)/p 
= 1 / p tokens. 



Substituting this result into the formula for A gives the average length of a string in the 
lambda-calculus: 



X = tokens. 

1 - Co - 2q 2 

Recalling that the information density of a language is the ratio of Its entropy and aver- 
age length, we have 



*7 



tf(?i.?2.?3) + (<? 1 7a)[#(p.p) + jVCp 1. • .Pcs)]/p , ■ L i 

= nits rcken. 

(q l~Jz)/ P + 2c 2 -- 2c 2 



It remains to compute the frequencies of occurrence of the tokens in the language. 
First -.ve compute p| P , the average number of left parentheses in a string. 

V\p = $!p( s ) 

= 7i$lp(-0 + qzt'.pV (' a I E')'\ + qzhpV (' £ £ ' )'! 

= 7p0 + q 2( l+p]p) + 73(l + r!p +t r!p) 

= 72 + 73 + (q 2 + 2g 3 )^] p . 
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Solving fur* pj p we have 

lz + 53 



C'L-arl*', c\p - :i ?1 :t is cdsc easy *;o ccr.y i:tc th • mr responding d:r..i t 

0\ p = o r? = •*]-/ ' - ^ 

Next we compute p >v the frequency of the token ‘A’, by the same process: 
&\ = ‘^.\(e ) 

= + l 2 'h\' (' ME- y\ + !J (•£'£’• Vj 

= ?a[^A(A) + <h(E)] + 2q 2 $x(E) 

= 9 a(l + Pa) + 2 ? 3v ? v 

Solving for p A ive have 

...... 



and 6\ = p A /,\. 

Finally, 've solve for y? t , the frequency of the i-th alphanumeric character: 
Pi = §t(e) 

= ?!?’.(/) + g 2 [*i(/) + * t (£)] + 2q 3 *i(E) 

= (gi+? 3 )*«(0 + (g a +2g 3 )r"i. 

Solving for ^ we have 

(gi 21 IMD. 

' Pi l-qn-2q 3 ' 

It remains to determine $*(/): 

*i(0 = «tK?ia9 • ■ • ©PceSri 



- <Mpia© • ®pss9\/p 

= [p i$t(a) + • ■ ■ +p 3 fl* t (9)]/p 
= Pi/p. 
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Substituting thi.- into the' equation Lor r q yields 

(7 1 + ?g>Pi 

T " 1 ( 1 - 'J : - 2q 2'1'P ' 

As usual, 64 = y ' A. 

Unfortunately, the equations termed f~r the Kmcd-i-calculus r- r:r inhuAt 
solve. In practice they v/ould probably have to be solved numerically. 

We can gain seme insight into these equations by considering their behavior in some 
typical situations. Therefore, suppose that all the identifier characters are equally 
likely: 

Pi = P2 = ’ = Pz3 = 1 / do. 

Then T .ve have 

7 1 + ?s 

3Sp(l -q- -2q 3 )' 

Now let p. - l-qz-2 73. Then we have 
X = [( 71 + 72 )/? + 3^2 +■ 2? 3 ]/ a, 

r.\ = 72/ a. 

rip = rrp = (72 + 72)/ ». 

= (7 1 + 7 2)/ ( 365 a). 

Rewriting the first equation: 

X = [(? i+ 72 )/p a ] + 1.372 -r 270)/ a 

= ?.;/36 - (3g 2 + 27 3 )/ a 

= ^-i /36 + 70/ a + (2g z + 2^3)/ a 
- ?%/ 36 4- r 2qjp. 

If we rewrite this 

X — r- / 36 + f qjp +■ ^i-p, 
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then il bee cents intLitr e: the average length of a siring is Ihe sum of the a 
quencies or occurrence of each terminal symbol. 



Finally, v/e derive the information consumed per production by this grxmm r. To do 
this -'’0 cm and ths 1 il a m a cross! 



f 0-1 ^ n \f 

i~ - 1 ~ x 

.4 -» p ] a 9 • • pcy9 



and compute its average derivation length: 

Z?(/) = 1 +pD(A) + pD(A) + pDU) 

D(A) = 1 

Therefore, D(I) = 2 + pD(l), so £(/) = 2/p. Next we compute £(£*): 
£(£) = i + gi0(O + 7e[^(/) + £(£)] + q 2 [2W)] 

= 1 + 2(71+72) + (72^273)^(^')- 



Therefore, 

1 + 2(g 1 + 7 P ) 

£(£7) = productions. 

1 - 72 - 273 1 

The average information used per production is the ratio H(E)/ D(E), which is 



„ ^(gi.ga.93) + (?i-*-g 2 )[^(p.?) + • ■ .Psg)!/P , 

O = — bits/production. 

l + 2(gi+?a) 



7. Conclusions 

7/e have described means for computing a number cf information-theoretic properties 
of languages and their grammars. These properties include, for languages, their 
entropy, average string length, information density and density of occurrence for a 
given token. For grammars w r e have shown how to compute average derivation length 
and the information used by the grammar per production. 

All cf these techniques are based on the application of simple recursive formulas to 
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annotated grammars, grammars annotated with production probability*;. " 

that the some techniques can be applied to the computation or many other prep mties 

of both grammars and ether symbol systems. 
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